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ABSTRACT 

A  new  formalism  is  given  for  read-modify-write  (RMW)  synchronization  opera- 
tions. This  formalism  is  used  to  extend  the  memory  reference  combining  mechan- 
ism, introduced  in  the  NYU  Ultracomputer,  to  arbitrary  RMW  operations.  A  for- 
mal correctness  proof  of  this  combining  mechanism  is  given.  General  requirements 
for  the  practicality  of  combining  are  discussed.  Combining  is  shown  to  be  practical 
for  many  useful  memory  access  operations.  This  includes  memory  updates  of  the 
form  mem_val  :=  mem_val  op  val,  where  op  need  not  be  associative,  and  a  variety 
of  synchronization  primitives.  The  computation  involved  is  shown  to  be  closely  re- 
lated to  parallel  prefix  evaluation. 

1.   Introduction 

Shared  memory  provides  convenient  communication  between  processes  in  a  tightly 
coupled  multiprocessing  system.  Shared  variables  can  be  used  for  data  sharing,  informa- 
tion transfer  between  processes,  and,  in  particular,  for  coordination  and  synchronization. 
Constructs  such  as  the  semaphore  introduced  by  Dijkstra  in  [Di],  and  the  many  variants 
that  followed,  provide  convenient  solutions  to  many  synchronization  problems  involving 
arbitrary  number  of  processes.  These  constructs  are  supported  in  hardware  by  machine 
instructions  that  atomically  execute  a  Read-Modify-Write  cycle.  Such  instructions  exist  on 
most  modern  CPU's. 

An  atomic  Read-Modify-Write  operation  only  requires  that  it  be  semantically  atomic, 
although  it  is  often  processed  atomically  also.  The  "serial  bottleneck"  created  by  this 
atomic  processing,  while  acceptable  for  small  scale  parallelism,  can  seriously  impair  the 
performance  of  a  system  with  thousands  of  processors. 

Frequent  accesses  to  a  shared  variable  not  only  slow  down  those  processes  performing 
the  access,  but  may  cause  the  entire  machine  to  thrash.  Large-scale  shared  memory  paral- 
lel processors  are  likely  to  use  multistage  packet  switched  interconnection  networks  for 
processor  to  memory  traffic.  These  networks  provide  high  bandwidth  and  short  latency 
time  when  memory  accesses  are  distributed  randomly,  but,  if  even  a  small  percentage  of 
the  memory  requests  are  directed  to  one  specific  spot,  the  network  becomes  congested  and 
performance  quickly  degrades.  A  recent  study  of  Pfister  and  Norton  [PN]  shows  that  not 
only  those  processors  attempting  to  access  the  same  "hot  spot"  are  delayed,  but  also  the 
remaining  processors.  Although  replication  of  data  can  often  be  used  to  circumvent  the  hot 
spot  problem  for  read-only  data,  it  cannot  be  used  for  synchronization  variables. 


The  performance  degradation  can  be  mitigated  by  a  memory  request  "combining" 
technique  (which  will  be  described  later).  Briefly,  combining  works  as  follows:  When  a 
"conflict"  occurs  within  the  network  for  the  same  switch  output  port  for  memory  requests 
directed  to  the  same  location,  a  new  combined  request  that  represents  the  conflicting 
requests  is  created.  Separate  replies  to  the  original  requests  are  later  created  from  the 
reply  to  the  combined  request.  The  logic  for  combining  and  uncombining  memory  refer- 
ences is  distributed  throughout  the  processor  to  memory  interconnection  network. 

It  is  worthwhile  emphasizing  that  such  simultaneous  requests  directed  at  the  same 
memory  cell  are  not  random,  rare  events.  When  processed  in  an  efficient  manner,  they 
can  form  the  basis  for  a  completely  parallel,  decentralized  operating  system  as  well  as  a 
building  block  for  efficient  parallel  programming  constructs.  A  general  discussion  of  the 
cost/performance  tradeoffs  of  the  combining  mechanism  has  been  argued  elsewhere. 

Indeed,  such  a  mechanism  was  proposed  for  read  requests  in  the  CHoPP  machine 
[SBK].  It  was  extended  to  handle  write  requests,  and  some  types  of  Read-Modify-Write 
requests  [Ru]  and  further  generalized  for  associative  Read-Modify-Write  operations  [GK]. 
These  ideas  are  used  to  implement  concurrent  reads,  writes,  and  "Fetch-and-Adds"  in  the 
NYU  Ultracomputer  [GGK]  and  IBM  RP3  [PBH]  machines. 

The  semantics  of  serial  processes  are  well  understood;  it  is  relatively  easy  to  argue  on 
the  correctness  of  serial  computers.  The  situation  is  quite  different  for  parallel  systems: 
Satisfactory  definitions  of  their  semantics  have  only  recently  evolved  ([LyF],  [Lai],  [La2], 
[La3])  and  our  intuition  often  fails  when  trying  to  formally  reason  about  parallel  systems. 
Therefore,  it  is  important  to  precisely  define  correctness  criteria  for  parallel  systems  and  to 
formally  argue  that  these  criteria  are  fulfilled. 

We  show  that  combining  fulfills  two  important  criteria:  (1)  Combining  is  a  general 
technique  that  applies  to  arbitrary  memory  access  operations,  not  just  an  ad  hoc  method  to 
handle  the  NYU  Ultracomputer  operations.  (2)  This  new  interconnect  mechanism  does  not 
change  the  properties  of  the  processor-memory  system. 

In  this  paper  we  address  these  issues  rigorously.  A  new,  very  general  formalism  for 
read-modify-write  (RMW)  operations  is  given.  A  general  definition  is  given  of  a  correct 
machine  implementation.  A  method  for  combining  general  RMW  operations  is  given  and 
proven  to  be  correct.    Several  families  of  memory  access  operations  are  analyzed  using  this 

Ultracomputer  Note  105  Page  2 


general  framework.  This  includes  familiar  operations  such  as  load,  store,  swap,  test-and- 
set,  fetch-and-add,  and  general  data-level  synchronization  primitives  (see  [GP]).  It  is  well 
known  that  any  associative  operation  can  be  combined  efficiently  [GK].  We  show  that 
other  combinable  families  of  operations  include  the  four  standard  arithmetic  operations,  all 
sixteen  boolean  functions,  and  synchronization  methods  such  as  full/empty  bits.  Implemen- 
tation issues  concerning  support  of  such  primitives  are  considered.  Finally,  the  combining 
mechanism  is  shown  to  be  closely  related  to  the  parallel  prefix  computation  problem  [LaF]. 

2.  Read-Modify- Write 

We  use  a  formalism  similar  to  that  developed  by  Lynch  and  Fisher  [LyF]:  A  parallel 
computation  consists  of  a  set  of  processes  that  execute  in  parallel.  Each  of  these  processes 
is  considered  to  be  a  sequential  program  augmented  with  the  ability  to  access  global, 
shared  variables.  We  restrict  our  attention  to  shared  memory  access  techniques  and 
assume  standard  operations  for  manipulation  of  local  (or  private)  data. 

Instead  of  the  usual  load  and  store  memory  access  operations  of  sequential  processing, 
all  accesses  to  shared  variables  are  assumed  to  be  Read-Modify-Write  (RMW)  operations. 
The  operation  (or  instruction)  RMW(X,f),  where  X  is  a  shared  variable  and  /  is  a  map- 
ping, is  defined  to  be  equivalent  to  the  indivisible  execution  of  the  following  function: 

function  RMW(X,f) 
begin 

temp  -  X; 
X  -f(X); 
return(temp) 
end 

This  operation  yields,  as  its  value,  the  old  value  of  the  variable  X  and  also  updates  the 
value  stored  in  X  according  to  the  updating  transformation/. 

The  usual  load  and  store  operations  are  particular  cases  of  RMW  operations:  a  load 
from  (the  address  of)  variable  X  is  equivalent  to  RMW(X,id),  where  id  is  the  identity 
mapping  (i.e.  f(x)  =  x).  A  store  of  value  v  to  variable  X  is  equivalent  to  RMW(X,Iv), 
where  ly  is  the  mapping  that  has  constant  value  v  (i.e.  f(x)  =  v);  the  returned  value  is 
ignored.  In  fact,  an  assignment  of  the  form  Y  •^RMW(X,Iy),  where  Y  is  a  private  vari- 
able and  X  is  a  shared  variable,  implements  a  swap  instruction:  X  and  Y  swap  values. 
Note  that  the   usual  use   of  swap   instructions   is  to  exchange  values  between  a  shared 
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variable  (the  lock)  and  a  private  variable  (the  key)  (see,  e.g.  [PET],  §9.5.4). 

The  well  known  test-and-set  instruction  can  also  be  implemented  as  an  RMW  instruc- 
tion.  We  have 

test-and-set(X)    =   RMW(X,Itrue)  • 

A  more  powerful  RMW  operation  is  the  fetch-and-add  synchronization  primitive.  It  is 
defined  by 

fetch-and-add(X,a)    =    RMW(X,+a)  , 

where  +a  is  Curried  addition,  i.e.   +a(x)  =  x  +  a.    It  corresponds  to  the  indivisible  execu- 
tion of  the  following  code. 

function  fetch-and-add(X,a) 
begin 

temp  ■^  X; 

X  -X  +  a; 

return(temp) 
end 

A  similar  operation  (replace-add)  was  introduced  many  years  ago  [DGSS].  It  was 
independently  considered  by  Dijkstra  [Di]  who  rejected  it,  beheving  it  to  be  an  inadequate 
tool  for  synchronization.  It  nevertheless  turned  out  to  be  a  very  useful  synchronization 
primitive,  and  was  essential  in  the  development  of  efficient  coordination  code  for  the  NYU 
Ultracomputer  operating  system  [Ru],[GLR].  The  change  from  replace-add  to  fetch-and- 
add  [GK]  simplified  the  combining  logic  and  paved  the  way  to  the  general  result  given  in 
this  paper. 

Any  memory  access  that  consists  of  reading  one  shared  memory  location,  performing 
an  arbitrary  local  computation,  then  updating  the  memory  location  can  be  expressed  as  an 
RMW  operation  of  the  above  form.  This  is  the  general  form  for  memory  accesses 
assumed  by  [LyF],  and  seems  to  encompass  most,  if  not  all,  useful  synchronization  opera- 
tions based  on  shared  variables.  Other  examples  of  RMW  operations  will  be  presented  in 
later  sections. 
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3.   Semantics 

In  their  classic  paper  describing  the  IBM  360  system,  Amdahl,  Blaauw,  and  Brooks 
[ABB]  introduced  the  notions  of  architecture,  implementation,  and  realization.  The  archi- 
tecture can  be  thought  of  as  the  abstract  machine  that  is  presented  to  the  user  at  the  assem- 
bly language  level  or  presented  in  the  principles  of  operations  manual.  The  implementa- 
tion is  how  hardware  is  used  to  implement  the  features  and  operations  of  the  architecture. 
The  realization  is  the  exact  specification  of  the  hardware,  such  as  which  chips  are  used  and 
how  they  are  wired  together.  In  an  implementation,  each  "atomic"  operation  of  the  archi- 
tecture may  actually  consist  of  several  "subatomic"  microoperations;  the  implementation 
may  use  stores^  that  are  not  visible  to  the  user.  The  implementation  is  correct  if  its  visible 
behavior  is  a  correct  behavior  of  the  architecture:  the  initial  to  final  state  mapping  on  visi- 
ble stores  is  the  same  for  the  architecture  as  for  the  implementation.  A  similar  situation 
holds  for  the  realization.  These  definitions  can  be  extended  and  generalized  to  all  the  lev- 
els of  an  architecture,  software  and  hardware.  At  each  level  an  architecture  is  imple- 
mented by  a  lower  one;  the  implementation  is  correct  if  it  yields  the  same  visible  behavior. 

3.1.   Definitions 

We  use  a  formalism  similar  to  that  developed  by  Lamport  [Lal],[La3].  The  state  of  a 
machine  is  represented  by  the  values  of  its  stores.  There  are  stable  stores,  such  as 
memory,  registers,  status  flags,  etc.,  and  transient  stores,  such  as  messages.  Stable  stores 
support  nondestructive  read  and  write  operations.  Messages  are  created  by  message 
transmission  operations,  and  destroyed  by  message  reception  operations.  They  are  used 
for  internal  communication  and  communication  with  the  external  world  (I/O  messages). 

The  execution  of  the  computer  can  be  viewed  as  consisting  of  a  set  of  atomic  events. 
Each  atomic  event  may  modify  the  value  of  one  or  more  stores,  and  create  or  receive  one 
or  more  messages.  The  semantics  of  an  atomic  event  is  defined  by  a  mapping  that  specifies 
the  state  transformation  associated  with  it:  messages  consumed,  messages  created  and  their 
values,  and  new  values  of  modified  stores.  This  naturally  extends  to  a  definition  of  the 
semantics  of  a  sequence  of  atomic  events  by  composition  of  mappings:  in  a  sequence, 
event  i+1  produces  a  new  state  based  on  the  state  produced  by  event  i. 


^  In  this  section  and  the  next,  the  term  store  will  denote  the  state  information;  the  term  write  will  denote 
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We  assume  that  one  can  observe  the  initial  contents  of  the  stores,  the  final  content  of 
the  stores,  and  the  order  of  I/O  events  (input  reception  and  output  transmission)  as  well  as 
their  values.  That  is,  the  observable  behavior  of  a  system  consists  of  the  (i)  initial  state  to 
final  state  mapping  induced  by  the  computation  and  (ii)  the  sequence  of  I/O  events  occur- 
ring during  the  computation.  Since  we  can  observe  the  time  (or  order)  of  each  external 
communication  event,  we  can  consider  them  to  be  totally  ordered. 

Many  atomic  events  may  occur  concurrently;  the  order  of  occurrence  of  two  events  is 
significant  only  if  their  execution  order  affects  the  observable  behavior  of  the  system.  This 
motivates  the  following  definitions.  Two  sequences  of  events  are  equivalent  if  for  any  ini- 
tial value  of  the  stores  and  any  sequence  of  input  messages,  the  execution  of  these  two 
sequences  yield  the  same  final  values  of  the  visible  stores  and  the  same  sequence  of  I/O 
events.  A  system  execution  is  a  set  of  events  partially  ordered  by  a  relation  -  such  that  any 
two  extensions  of  -  to  total  orders  yield  equivalent  sequences  of  events.  We  say  that  event 
a  precedes  event  (3  if  a  -  p.  Our  definition  implies  that  the  execution  order  -  captures  all 
dependencies  that  exist  between  atomic  events. 

The  definition  of  a  system  execution  (usually)  implies  that  the  relation  ->  has  the  fol- 
lowing properties: 

(1)  If  u  and  V  access  the  same  store,  and  one  of  the  accesses  is  a  write  access  then  either 
u-v  or  v->u.  (An  event  "reads"  the  stores  that  are  in  the  domain  of  the  mapping  asso- 
ciated with  it,  and  "writes"  the  stores  that  are  in  the  range  of  this  mapping.) 

(2)  If  u  and  v  are  external  communication  events  and  u  occurs  before  v  then  u-v 
(remember  that  external  communication  events  are  totally  ordered). 

(3)  If  u  creates  a  message  and  v  receives  that  message  then  u-v. 

The  architecture  of  a  computer  is  understood  in  terms  of  operations;  each  operation 
may  consist  of  several  atomic  events.  The  partial  order  relation  -  on  atomic  events  induces 
a  relation,  denoted  by  6,  on  operations.  Operation  u  precedes  operation  v,  i.e.  u  e  v,  if 
some  event  of  u  precedes  some  event  of  v. 

Correctness  criteria  are  expressed  in  terms  of  constraints  on  possible  system  execu- 
tions; a  system  is  defined  by  the  set  of  legal  system  executions.    For  example,  if  operations 


the  store  operation. 
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are  required  to  be  atomic  then  the  execution  relation  €  induced  by  -  must  be  a  partial 
order:  A  cycle  uC  •  •  •  Cu  implies  that  some  event,  say  3,  of  an  operation,  say  v,  can  be 
seen  to  occur  after  some  event,  say  aj,  belonging  to  u,  i.e.  (aj  -  P),  and  before  some 
other  event,  say  a2  belonging  to  u,  i.e.  O  -  02);  this  implies  u  is  not  indivisible.  Con- 
versely, if  the  execution  relation  induces  a  partial  order  on  operations  then  it  can  be 
extended  to  a  total  order  so  that  events  belonging  to  the  same  operation  are  contiguous: 
The  outcome  of  the  execution  is  as  if  the  operations  were  executed  serially,  with  each 
operation  terminating  before  the  next  one  starts. 

3.2.   Composing  Systems  from  Subsystems 

It  is  often  convenient  to  define  a  system  as  a  composition  of  subsystems.  The  stores 
of  the  system  are  the  stores  of  the  subsystems  and  the  events  of  the  system  are  events  of 
the  subsystems.  We  assume  that  subsystems  communicate  only  by  messages:  an  event  of  a 
subsystem  may  modify  only  stores  of  that  subsystem,  but  it  may  create  a  message  that  is 
latter  consumed  by  another  subsystem.  (This  is  similar  to  the  work  pioneered  by  Milner 
and  Milne  [MM]  in  the  context  of  synchronous  communicating  processes.) 

The  semantic  specification  of  the  global  system  is  derived  from  the  semantic  specifica- 
tions of  the  composing  subsystems.  Each  event  is  associated  with  the  corresponding  map- 
ping in  its  subsystem.    The  set  of  legal  system  executions  is  defined  as  follows: 

Let  -  be  partial  order  on  events  of  the  system.  This  order  induces  an  ordering  of  the 
events  within  each  subsystem.  This  partial  order  is  not  necessarily  an  execution  order:  the 
relation  -  may  not  define  an  order  on  communication  events  that  are  external  to  a  subsys- 
tem but  internal  to  the  global  system.  When  a  subsystem  is  considered  in  isolation,  the 
order  in  which  it  executes  external  communication  events  is  deemed  meaningful;  when  it  is 
part  of  a  bigger  system  the  order  of  its  communication  with  other  subsystems  may  not  be 
meaningful,  i.e.  does  not  necessarily  affect  the  global  behavior  of  the  system. 

The  relation  -  defines  a  correct  system  execution  if  it  can  be  extended  (by  ordering  all 
communication  events  within  each  subsystem)  to  a  relation  =>  ,  such  that  the  restriction  of 
==>•  to  each  subsystem  is  a  correct  execution  of  the  subsystem.  Informally,  a  global  system 
execution  is  correct  if  each  subsystem  may  view  it  as  a  correct  local  system  execution, 
where  these  different  views  are  consistent. 
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3.3.  An  Example  —  The  Uniprocessor 

For  example,  we  can  consider  a  serial  computer  to  consist  of  two  separate  subsystems: 
processor  and  memory.  The  processor  executes  a  stream  of  instructions.  The  memory 
accepts  a  stream  of  requests  (read,  write,  read-modify-write,  etc.).  Each  request  may 
modify  the  memory  content  and  return  a  value. 

Assume  we  have  a  formal  definition  for  a  correct  execution  by  a  serial  processor. 
Informally,  such  a  definition  associates  with  each  instruction  a  sequence  of  memory 
accesses  and  a  mapping  that  computes  the  next  state  of  the  processor,  given  the  current 
state  and  the  values  returned  from  memory.  It  specifies  that  instructions  are  executed 
atomically,  so  that  the  outcome  of  the  execution  of  a  stream  of  instructions  can  be  com- 
puted by  composing  the  mapping  associated  with  each  consecutive  instruction.  An  execu- 
tion totally  orders  successive  instructions  executed  by  a  processor. 

Similarly,  we  assume  the  existence  of  a  formal  definition  for  a  correct  execution  by 
memory.   A  memory  operation  consists  of  three  events: 

(1)  receives  a  memory  request  message; 

(2)  processes  the  request,  possibly  modifying  the  memory  content;  and 

(3)  sends  a  reply  message  (we  assume  that  all  accesses  generate  replies;  the  reply  is  an 
acknowledgment  for  accesses  that  do  not  return  values). 

We  have  M. receive. request  -»  M. process. request  -  M. send. reply;  memory  operations  are 
executed  atomically. 

A  correct  execution  of  the  system  must  respect  data  dependencies:  if  instruction  u  pre- 
cedes instruction  v,  and  both  access  the  same  memory  location,  then  the  access  on  behalf  of 
u  must  occur  before  the  access  on  behalf  of  v.  This  correctness  condition  does  not  occur 
explicitly  in  our  definitions;  it  pertains  neither  to  the  processor  nor  to  the  memory,  but  to 
their  interaction.  We  shall  show  that  it  implicitly  follows  from  the  correctness  require- 
ments of  the  subsystems. 

Let  -  be  a  partial  order  defined  by  a  correct  execution  of  the  system  consisting  of  pro- 
cessor and  memory,  and  let  6  be  the  relation  induced  on  instructions.  Let  u  and  v  be  two 
processor  instructions  that  access  the  same  memory  location  such  that  one  of  the  instruc- 
tions  is  a   write  and   v   follows  u.     We  have  uCv.     Assume,   by   contradiction,  that  the 
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memory  access  on  behalf  of  v  is  executed  before  the  memory  access  on  behalf  of  u,  i.e 

M. process. request. V  -  M. process. request. u  . 
Since,  by  the  ordering  of  events  of  an  operation,  we  know  that 

P. send. request. V  -  M. receive. request. v  -  M. process. request. v 
and 

M. process. request. u  -  M. send. reply. u  -  P. receive. reply. u  , 
we  get  the  ordering: 

P. send. request. V  -  M. receive. request. v  -  M. process. request. v 
-  M. process. request. u  ->  M. send. reply. u  -  P. receive. reply. u 

so  that  vEu,  and  -  cannot  be  extended  to  relation  that  induces  a  partial  order  on  processor 
instructions. 

A  correct  implementation  of  the  processor/memory  system  must  ensure  that  memory 
accesses  are  executed  in  an  order  consistent  with  the  order  instructions  are  issued,  when- 
ever there  is  a  memory  access  conflict.  Thus,  the  outcome  of  a  (correct)  execution  is  as  if 
the  instructions  were  executed  serially. 

3.4.   Multiprocessors 

We  wish  to  extend  these  definitions  to  a  shared  memory  multiprocessor.  Such  a 
machine  consists  of  several  processors  and  several  shared  memory  modules.  Each  proces- 
sor and  memory  module  is  defined  as  in  the  previous  example.  We  assume  the  existence 
of  a  formal  definition  for  the  correct  execution  for  a  processor,  and  of  correct  execution 
for  a  memory  module. 

The  correctness  of  the  entire  system  is  derived  as  previously:  an  execution  relation  - 
is  correct  if  it  can  be  extended  to  a  relation,  ^  ,  that  correctly  orders  events  at  each  pro- 
cessor and  at  each  memory  module.  If  the  execution  is  correct  then  the  atomic  events  can 
be  serially  ordered  so  that  events  pertaining  to  the  same  processor  instruction  are  contigu- 
ous. The  outcome  of  the  execution  is  as  if  the  instructions  were  executed  serially,  with  all 
events  of  one  instruction  terminating  before  any  event  of  then  next  instruction  starts,  so 
that  for  each  processor  the  subsequence  of  events  of  this  processor  is  a  valid  execution  for 
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the  processor.  This  is  the  "sequential  consistency  principle"  stated  by  Lamport  [Lai].  It 
implies  that  we  can  view  a  multiprocessor  as  a  system  of  sequential  processes  communicat- 
ing via  shared  variables,  where  each  instruction  is  an  atomic  operation  [LyF];  access  to 
(shared)  memory  is  perceived  to  occur  simultaneously  with  the  execution  by  the  processor 
of  the  instruction  that  generates  the  access. 

3.5.   Asynchronous  Memory  Access 

The  sequential  consistency  principle  can  be  enforced  in  hardware  either  by  using  a 
central  controller  for  memory  accesses  [Lai]  or  by  requiring  each  processor  to  wait  for  an 
acknowledgement  after  each  shared  memory  access  (before  beginning  to  process  the  next 
shared  memory  access).  Both  choices,  however,  severely  limit  the  performance  of  a  large 
scale  parallel  processor.  A  central  controller  becomes  a  serial  bottleneck  when  there  are  a 
large  number  of  processors.  The  network  latency  time  is  long  (as  compared  to  the  basic 
instruction  cycle  time  of  each  processor)  in  a  shared  memory  machine  with  a  large  number 
of  processors  and/or  memory  modules.  This  latency  time  overhead  can  be  mitigated  by 
allowing  the  processor  to  continue  processing  before  receiving  an  acknowledgment.  For 
example,  the  NYU  Ultracomputer  and  RP3  hardware  allow  the  pipelining  of  shared 
memory  accesses  from  the  processors. 

These  machines  present  the  user  with  a  shared  memory  multiprocessor  architecture 
with  the  following  types  of  atomic  events: 

(1)  Execution  of  a  local  instruction,  i.e.  instructions  that  involve  only  local  stores;  and 

(2)  Execution  of  events  comprising  a  shared  memory  access  operation  (an  RMW  opera- 
tion). We  assume  that  each  such  operation  involves  only  one  shared  memory  module 
and  consists  of  three  atomic  events: 

SEND  -  a  request  message  is  issued  by  the  processor. 

ACCESS  -  the  request  message  is  consumed  by  a  memory  module,  the  request  is 

executed,  and  a  reply  message  is  generated. 
RECEIVE  -  the  reply  message  is  consumed  by  the  processor. 

The  three  components  of  the  same  shared  memory  access  operation  are  ordered 

SEND  -  ACCESS  -  RECEIVE  . 
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The  control  logic  of  each  processor  may  impose  constraints  on  the  sequencing  of  the 
events  executed  by  the  processor.  However,  it  does  not  necessarily  wait  for  a  reply 
from  a  shared  memory  access  before  proceeding  with  another  event. 

We  call  a  machine  with  such  an  architecture  a  Multiprocessor  with  Asynchronous 
Shared  Memory  or  MASM.  A  MASM  architecture  does  not  necessarily  fulfill  the  sequential 
consistency  principle,  i.e.  is  not  "correct"  according  to  the  usual  definitions;  however,  it 
can  implement  a  sequentially  consistent  multiprocessor.  The  sequential  consistency  princi- 
ple is  enforced  by  a  software  solution,  involving  compile  time  analysis  of  the  global  code, 
that  specifies  constraints  on  the  pipelining.  These  constraints  are  enforced  by  the  control 
logic  of  each  processor.  For  example,  the  NYU  and  RP3  software  distinguishes  between 
"private"  variables,  "shared"  read-only  variables  and  "shared"  read/write  variables  (all  of 
which  can  be  stored  in  shared  memory),  and  prohibits  the  pipelining  of  accesses  to  vari- 
ables of  the  latter  type.  Shasha  and  Snir  [SS]  propose  a  more  elaborate  analysis  based  on 
compile  time  detection  of  data  dependencies;  this  analysis  is  used  to  define  "delay"  pairs, 
i.e.  pairs  of  memory  accesses  at  the  same  processor  such  that  the  first  access  must  com- 
plete before  the  second  starts. 

The  last  definition  did  not  mention  the  communication  medium  between  processors 
and  memories.  We  assume  that  this  interconnection  network  is  "invisible";  its  state  is  not 
observed  by  the  user.  Note,  too,  that  we  assume  asynchronous  processor  to  memory  com- 
munication; it  is  only  the  relative  order  of  events  that  affect  the  result,  not  the  absolute 
time  of  their  execution.  Our  formalism  does  not  encompass  synchronous  communications, 
or  time-out  mechanisms. 

4.   Combining  Mechanism 

There  have  been  many  proposals  for  the  architecture  for  parallel  processors.  The 
main  issue  is  how  to  interconnect  the  processors  so  that  they  may  communicate  efficiently. 
While  shared  bus  type  architectures  are  well  suited  for  interconnecting  dozens  of  proces- 
sors and  memory  modules,  multistaged  interconnection  networks  appear  to  be  required  for 
larger  scaled  parallel  machines.  We  first  describe  our  assumptions  concerning  the  inter- 
connection network  and  then  give  a  general  technique  for  "combining"  common  shared 
memory  requests.  We  show  that  this  implementation  is  correct  in  the  sense  described  in 
the  previous  section. 
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4.1.  Processor  to  Memory  Connection 

We  assume  a  MASM  architecture  as  defined  in  §3.5,  and  for  the  sake  of  definiteness, 
make  the  following  additional  assumptions: 

(1)  The  processors  communicate  with  shared  memory  modules  via  a  multistaged  intercon- 
nection network.  The  network  is  packet  switched.  It  may  be  either  multistage  or 
recirculating. 

(2)  A  reply  message  is  sent  back  on  the  same  path  followed  by  the  request  message.  This 
condition  is  trivially  satisfied  for  multistage  networks  that  have  a  unique  path  connect- 
ing each  processor  to  each  memory  module.  It  is  easy  to  enforce  the  condition  in  any 
network:  A  message  can  construct  as  it  travels  through  the  network  a  header  describ- 
ing its  path;  this  header  is  used  to  route  the  reply  in  the  reverse  direction  [GGK]. 

These   assumptions   are  also  made  in  the  NYU   Ultracomputer   [GGK]   and  IBM's  RP3 
machine  [PBH]. 

4.2.  How  to  Combine  Requests 

We  assume  that  memory  accesses  are  RMW  operations.  A  memory  request  message 
has  the  form  <id,addr,f>,  where  id  is  an  identifier  that  uniquely  identifies  the  request, 
addr  is  a  reference  (address)  to  a  memory  location,^  and /is  (the  encoding  of)  a  mapping. 
When  this  message  reaches  memory,  (giaddr,  the  contents  of  location  addr,  is  replaced  by 
f((Siaddr),  and  a  message  <id,(2'addr>  containing  the  original  contents  of  location  addr  is 
returned. 

Suppose  that  two  request  messages  of  the  form  <idi,addr,f>  and  <id2,addr,g>  meet 
at  the  same  switch.  These  two  messages  have  the  same  destination  and  thus  conflict.  We 
propose  combining  these  two  messages  into  a  single  message.   This  is  done  as  follows: 

(1)  The  switch  stores  idj,  id2,  and /and  forwards  the  message  <id],addr,fog>,  where  fog 
is  (an  encoding  of)  the  composition^  of/and  g. 

(2)  When  a  reply  message  <idi,val>  to  this  composed  request  reaches  the  switch,  the 
stored    information    is    retrieved    by    matching    the    id's;    a    message    <id;i,val>    is 


^  The  address  may  be  part  of  the  identifier.   Thus,  if  each  processor  has  at  most  one  outstanding  request  to 
each  address,  then  the  processor  number  can  be  used  as  identifier. 
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forwarded  as  a  reply  to  the  first  request  <idi,addr,f>,  and  a  message  <id2,f(val)>  is 
forwarded  as  a  reply  to  the  second  request  <id2,addr,g>. 

Assume  that  the  combined  request  <idi,addr,fog>  is  not  further  combined  in  the  net- 
work. Then  <idi,@addr>  is  returned  as  a  reply,  and  the  value  @addr  is  replaced  in 
memory  by  g(f(@addr)).  At  the  switch  the  reply  <idi,@addr>  is  forwarded  (back)  to  the 
first  request,  and  the  reply  <id2,f(@addr)>  is  forwarded  (back)  to  the  second  request. 
This  is  illustrated  in  Figure  1.  The  final  effect  is  as  if  the  first  request  was  executed, 
returning  the  value  @addr  and  replacing  it  in  memory  with  fi@a.ddr),  and  then  the  second 
request  was  executed,  returning  the  value  f((a)addr)  and  replacing  it  with  g(f(@addr)). 
Combining  is  transparent:  the  operations  executed  by  the  processors  and  the  final  memory 
content  are  the  same  as  would  occur  without  combining. 

4.3.   Correctness  of  Combining  Mechanism 

We  now  show  that  this  implementation  is  correct:  The  observable  behavior  in  a  com- 
putation of  a  combining  machine  is  a  behavior  that  could  be  observed  in  a  computation  of  a 
noncombining  machine.  Note  that  the  reverse  is  not  necessarily  true:  There  are  sequences 
of  observable  events  that  occur  in  a  noncombining  machine  but  can  not  occur  in  a  combin- 
ing one.    (We  follow  what  Lamport  calls  the  "restrictive"  approach  to  specification  [La3].) 

Our  implementation  does  not  change  the  set  of  operations  executed  by  the  processors; 
it  is  transparent  to  the  processor  logic.  It  may  reduce  the  number  of  ACCESS  operations 
that  are  executed;  however,  the  memory  state  that  occurs  after  the  execution  of  an  ACCESS 
operation  in  the  combining  machine  could  occur  in  some  valid  computation  of  the  noncom- 
bining machine  (after  the  execution  of  some  sequence  of  ACCESS  operations).  In  other 
words,  for  each  sequence  of  operations  in  a  combining  machine  there  exists  a  sequence  of 
operations  in  a  noncombining  machine  that  is  equivalent  in  the  following  sense: 

(1)  The  same  operations,  in  the  same  order,  are  executed  by  the  processors  in  either 
machine. 

(2)  The  value  of  each  RECEIVE  message  is  the  same  in  both  machines. 

(3)  The  final  value  of  each  shared  memory  location  is  the  same  in  both  machines. 


We  use  fog(x)  to  denote  g(f(x)). 
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Note  that  this  does  not  imply  that  the  combining  machine  satisfies  the  sequential  con- 
sistency principle.  It  only  implies  that  the  combining  machine  is  a  correct  implementation 
of  a  MASM  architecture.  Any  mechanism  that  can  be  used  by  the  processors  of  a  noncom- 
bining  machine  to  enforce  sequential  consistency  will  achieve  the  same  goal  on  a  combining 
machine. 

In  general,  a  combined  request  can  be  further  combined.  An  inductive  proof  is 
needed  to  show  that  the  final  outcome  is  correct.  In  a  noncombining  network  each  SEND  is 
associated  with  one  ACCESS  and  one  RECEIVE;  in  a  combining  network  each  SEND  is  associ- 
ated with  one  RECEIVE,  but  several  SEND  operations  may  result  in  one  (combined)  ACCESS. 

Each  memory  request  message  in  the  network  is  associated  with  a  sequence  of 
memory  request  messages  issued  by  processors.  A  memory  request  issued  by  a  processor 
represents  itself;  if  memory  request  A  was  obtained  by  combining  B  with  C,  where  B 
represents  requests  bj,  •  •  ■  ,bi  and  C  represents  requests  Cj,  •  •  •  ,Cj,  then  we  say  A 
represents  requests  bi,  •  •  •  ,bi,Ci,  •  ■  •  ,Cj. 

Lemma:  Consider  a  combining  machine  as  in  §4.2.  Let  A  =  <id,addr,f>  be  a  memory 
request  message,  representing  requests  aj  =  <idi,addr,fi>,  •  ■  •  ,  an  =  <idn,addr,fn>. 
Let  a'i  be  the  reply  message  associated  with  a^,  i.e.  the  reply  message  <idj,val>  received 
by  the  processor  that  issued  a^.    Then 

(1)  f  =  fio  ■  ■  ■  ofn; 

(2)  The  values  returned  by  all  of  the  a'j  are  the  same  as  would  be  returned  if  the  memory 
accesses  associated  with  requests  a^,  ■  •  ■  ,^^  in  a  noncombining  network  were  exe- 
cuted consecutively. 

(3)  If  request  A  reaches  memory  without  being  combined,  the  value  stored  at  location 
addr  after  execution  of  request  A  is  the  same  as  the  final  value  stored  at  location  addr 
after  consecutively  executing  the  memory  accesses  associated  with  a^,  •  •  •  ,an  in  a 
noncombining  network. 

Proof:  The  lemma  is  proven  by  induction  on  the  number  of  requests  represented  by  a 
memory  access  message.  It  is  trivial  for  a  message  that  represents  one  request.  Next, 
assume  that  the  lemma  is  true  for  messages  representing  less  than  n  requests,  and  assume 
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that  A  is  obtained  by  combining  B  and  C,  where  B  represents  r  requests  and  C  represents 
n-r  requests  (l<r<n).  Let  B  =  <id^,addr,g>  and  C  =  <id^,addr,h>,  so  that 
A  -  <id^,addr,goh>.  Then  message  A  generates  a  reply  <id^,val>,  which  will  also  be 
the  reply  to  request  B;  request  C  generates  the  reply  <id^,g(val)>.  If  A  reaches  memory 
then  val  =  @addr  and  the  new  value  in  memory  is  h(g(@addr)). 

Let  bj  =  <idi',addr,gi>,  ■  •  •  ,  b^  =  <idj',addr,gi.>  be  the  sequence  of  requests  B 
represents;  similarly,  let  Cj  =  <idi,addr,hi>,  •  •  •  ,  Ci,_r  =  <idn_j,addr,hn_i.>  be  the 
sequence  of  requests  C  represents.  Let  b'j  and  c'j  be  the  reply  messages  associated  with 
the  respective  requests.  By  the  inductive  assertion,  g  =  g^o  ■  •  •  ogj.  and 
h  =  hjo  •  •  •  ohn_r;  the  messages  b'j  return  the  values  val,  gi(val), 
gr-i(  ■  •  •  (gi(val))  •  •  •  );  the  messages  c';  return  the  values  g(val),  hi(g(val)),  .... 
hn_r_i(  •  •  •  (hi(g(val)))  •  •  •  )•  It  follows  that  the  values  returned,  and  the  new  memory 
value  when  A  reaches  memory  are  as  if  the  memory  accesses  associated  with 
bj,  ■  •  •  ,br,ci,  •  •  •  ,Cn_i.  in  a  noncombining  network  were  successively  executed  in  this 
order.    This  proves  the  lemma.    □ 

Theorem:  The  implementation  of  shared  memory  access  by  a  combining  network  is 
correct. 

Proof:  The  previous  lemma  clearly  implies  the  theorem:  Indeed,  let  a^,  •  •  •  .a,,  be  a  seri- 
alization of  the  events  in  an  execution  of  a  machine  with  a  combining  network.  Replace 
each  ACCESS  event  a;  by  the  sequence  of  ACCESS  events  associated  (in  a  machine  with  non- 
combining  network)  with  the  requests  represented  by  the  message  that  generated  a;.  Then 
all  events  occurring  at  processors  appear  in  the  same  order  in  both  sequences;  the  RECEIVE 
messages  return  the  same  values;  and  the  shared  memory  state  after  the  execution  of  an 
ACCESS  event  in  the  first  sequence  is  identical  with  the  memory  state  after  execution  of  the 
corresponding  sequence  of  ACCESS  events  in  the  second  sequence.   □ 

5.   Applications 

Suppose  one  intends  to  combine  RMW  requests  with  mappings  from  some  family  $  of 
transformations.    Composition  can  generate  any  mapping  in  the  semigroup''  <t>  spanned  by 
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$.   We  need  to  have  an  encoding  for  the  mappings  in  $  so  that 

(1)  the  computer  representations  of  mappings  from  $  have  reasonable  size; 

(2)  the  encoding  of  fog  can  be  easily  computed  from  the  encoding  of/ and  the  encoding  of 
g;  and 

(3)  f(a)  can  be  easily  computed  from  the  computer  representations  of/ and  a. 

We  shall  not  give  a  formal  definitions  of  "reasonable"  or  "easily  computed",  as  these  are 
application  dependent;  we  have  in  mind  encodings  that  use  a  small  constant  number  of 
words,  and  computations  that  require  few  machine  cycles.  We  say  that  <E>  is  tractable  if  it 
fulfills  these  conditions. 

5.1.   Loads  and  Stores 

Recall  that  a  load  from  variable  X  is  equivalent  to  RMW(X,id),  where  id  is  the  iden- 
tity mapping,  and  a  store  (actually  a  swap)  of  value  v  to  variable  X  is  equivalent  to 
RMW(X,Iv),  where  ly  is  the  mapping  that  has  constant  value  v.  The  set  of  mappings 
{Iv}[j{id}  is  a  semigroup,  and  composition  is  easily  computed.  A  mapping  from  this  semi- 
group is  represented  by  one  computer  word  and  one  opcode  bit.  The  composition  yields 
the  expected  results: 

•  A  load  followed  by  a  load  combine  into  a  load. 

•  A  load  followed  by  a  store  combine  into  a  store  (the  value  fetched  is  returned  to  the 
load). 

•  A  store  followed  by  a  load  combine  into  a  store  (the  value  being  stored  is  returned  to 
the  load). 

•  A  store  followed  by  a  store  combine  into  a  store  of  the  second  value. 

One  need  not  transmit  the  value  returned  by  a  store  request,  as  this  is  of  no  interest; 
an  acknowledgment  suffices.  A  combined  request  needs  to  return  a  value  only  if  the  first 
atomic  request  in  it  is  a  load  operation.  One  can  avoid  returning  values  in  the  other  cases 
by  tagging  these  instructions.  Then,  with  the  possible  exception  of  these  extra  tag  bits, 
combining  never  generates  extra  traffic;  often  it  will  decrease  it  significantly. 


■*  A  semigroup  is  a  set  dosed  under  an  associative  operation,  which  in  this  case  is  map  composition. 
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Note  that,  in  general,  the  order  of  combined  requests  is  arbitrary  and  can  be  reversed. 
This  can  be  used  to  further  simplify  combining.  For  example,  if  the  network  always 
chooses  to  effect  a  store  before  a  load  whenever  two  such  requests  are  combined,  then  a 
store  never  needs  to  return  a  value. 

The  situation  of  a  store  combined  with  a  load,  suggests  a  slight  improvement  in  per- 
formance by  satisfying  the  load  immediately.  That  is,  the  store  would  be  forwarded  to  the 
memory  module  and  its  value  will  also  be  returned,  as  soon  as  possible,  back  to  the  proces- 
sors that  issued  the  load.  A  computation  on  such  machine  is  still  equivalent  to  a  computa- 
tion on  a  machine  with  noncombining  network,  where  local  operations  SEND,  ACCESS,  and 
RECEIVE  are  atomic  events.  However,  it  is  no  longer  true  that  ACCESS  -  RECEIVE;  a  pro- 
cessor may  get  a  reply  to  a  load  request  long  before  the  value  returned  is  actually  stored  in 
memory.  This  departure  from  the  MASM  model  may  lead  to  an  incorrect  behavior;  in  par- 
ticular, constraints  on  the  scheduling  of  events  at  each  processor  can  not  enforce  sequential 
consistency. 

If  <I>  is  a  semigroup  of  mappings,  then  'I' UilvJU^''^}  ^^  ^  semigroup  too.    We  have 

fo  id  =  id  o  f  =  f  , 

fo  ly  =  Iv  ,     and 

lyO  f  =   If(v)   • 

Thus,  if  $»  is  tractable,  then  $lJ{Iv}U^*^}  ^^  tractable.  In  other  words,  it  is  always  possi- 
ble to  add  load  and  store  operations  to  a  family  of  RMW  operations,  and  combine  them 
all,  without  greatly  increasing  the  complexity  of  the  system. 

Our  discussion  has  assumed  that  stores  and  loads  always  affect  an  entire  memory  cell 
(word  of  memory).  If  we  assume  a  word-addressable  machine,  say  with  four  byte  words, 
then  combination  of  store  operations  that  affect  only  bytes  or  half-words  will  require  intro- 
ducing store  operations  that  affect  any  subset  of  bytes  in  a  word.  At  a  higher  level,  if  one 
combines  atomic  stores  that  affect  components  of  a  structured  variable  then  one  needs  to 
support  stores  that  affect  an  arbitrary  subset  of  the  components  of  this  variable. 
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5.2.  Associative  Operations 

Let  B  be  an  associative  operation.  Then  fetch-and-Q{X,z)  is  equivalent  to 
RMW(X,6a),  where  BgCx)  =  x9a.  The  function  fetch-and-QiX, a)  corresponds  to  the  indi- 
visible execution  of  the  following  code. 

function  fetch-and-  6(X,a) 
begin 

temp  -  X; 

X  -X8a; 

return(temp) 
end 

A   fetch-and-6(X,a)    followed  by  fetch-and-8(X,b)   can  combine  into  fetch-and-8(X,a9b), 

since 

eaoObCx)  =  ebCOaCx)) 

=    (x8a)eb 

=    x8(a8b)  (since  9  is  associative) 

=    6aeb(x)  • 

Thus,  the  semigroup  {83}  is  tractable  whenever  8  is  easy  to  compute. 

Perhaps  the  most  important  fetch-and-8  primitive  for  large-scale  shared  memory 
machines  is  the  fetch-and-add,  which  was  discussed  earlier.  The  mapping  can  be 
represented  by  one  computer  word  (the  addend).  Two  other  potentially  useful  fetch-and-8 
primitives  are  fetch-and-OR,  where  OR  is  Boolean  addition,  and  fetch-and-min.  Fetch-and- 
OR(X,l)  is  the  test-and-set  operation.  Fetch-and-min  is  useful  for  allocation  with  priori- 
ties. 

5.3.  Boolean  Operations 

The  sixteen  Boolean  operations  can  also  be  combined,  despite  the  fact  that  some  of 
them  are  not  even  associative  operations.  Moreover,  each  of  the  operations  can  be  applied 
to  bit  vectors,  of  one  word  size.   We  will  first  consider  the  unary  Boolean  operations. 

Let  4>  be  the  set  of  four  Boolean  functions  on  one  variable,  0,  1,  x,  and  x.  The  associ- 
ated RMW  operations  are  test-and-clear,  test-and-set,  load,  and  test-and-complement .  The 
four  functions  in  $  can  be  represented  by  two  bits,  and  can  be  composed  using  the  follow- 
ing 4x4  table. 
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load 

clear 

set 

comp 

clear 

clear 

set 

set 

set 

clear 

set 

clear 

comp 

clear 

set 

load 

load        clear      set      comp 
load 

clear 

set 

comp 

The  function  compositions  can  be  computed  in  hardware  with  few  gates.   Thus  O  is  a  tract- 
able semigroup. 

As  a  result  all  sixteen  operations  fetch-and-9,  where  6  is  a  binary  Boolean  operation, 
can  be  combined.  The  reason  is  that  the  value  of  the  second  variable  is  fixed  to  a  constant 
(0  or  1)  when  a  request  is  issued,  and  every  Boolean  operation  on  two  variables  with  one 
of  the  variables  fixed  is  equivalent  to  some  Boolean  operation  on  one  variable.  For  exam- 
ple, fetch-and-AND(X,a)  is  a  load  when  a=l,  and  is  a  test-and-clear  when  a  =  0. 

This  result  can  be  extended  to  Boolean  operations  on  bit  vectors.  Mappings  on  bit 
vectors  of  length  n  are  represented  by  2n  bits.  Such  operations  are  useful  to  support  multi- 
ple locking. 


5.4.   Arithmetic  Operations 

Let  ■^  be  the  set  of  arithmetic  operations  addition,  subtraction,  multiplication,  and 
division.  We  also  put  into  "^  the  reverses  of  the  two  noncommutative  operations:  reverse 
subtraction  of  a  and  b  is  b  — a,  and  reverse  division  of  a  and  b  is  b/a.  We  wish  to  support 
and  combine  all  the  operations  of  the  form  fetch-and-\^,  where  il^C^.  In  order  to  do  that, 
we  need  to  support  and  combine  the  operations  RMW(X,vl^a),  where  vji^'^,  and 
\lia(X)  =  Xi};a.  The  semigroup  spanned  by  the  set  of  mappings  {vjig  :  li*?^}  consists  of  the 
Moebius  functions:   These  are  the  functions  of  the  form 

ax  +  b 
cx  +  d 

where  a,  b,  c,  and  d  are  constants,  and  either  c^^^O  or  d=?^0. 

We  represent  the  function 
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by  the  2x2  matrix  of  coefficients 


ax  +  b 
cx  +  d 


a   b 
c    d 


If  fys^^  is  the  Moebius  function  represented  by  the  matrix  A,  then 

Thus,  a  function  is  represented  by  four  coefficients,  and  two  functions  are  composed  by 
multiplying  two  2x2  matrices. 

We  can  now  efficiently  support  all  assignments  of  the  form  x*-x6c,  where  9  is  an  arbi- 
trary arithmetic  operation,  and  c  is  a  constant  or  a  private  variable.  These  assignments  will 
be  executed  atomically,  while  still  being  combined  in  the  network.  Such  assignments  form 
a  large  part  of  the  machine  code  in  typical  applications. 

If  one  wishes  to  support  only  addition  and  multiplication,  then  it  is  sufficient  to  con- 
sider functions  of  the  form 

x  -*  ax  +  b  , 

which  can   be  represented   using  only  two  coefficients.     Combining  two  such  mappings 
requires  two  multiplications  and  one  addition. 

Hardware  arithmetic  operations  are  not  associative.  Use  of  the  associativity  law  may 
change  occurrences  of  overflows  in  integer  arithmetic,  and  may  change  occurrences  of 
overflows,  underflows,  and  rounding  errors  in  floating  point  arithmetic.  As  our  combining 
mechanism  relies  on  associativity,  the  arithmetic  may  not  produce  the  same  results  as 
would  the  serial  order  of  the  operations.  Furthermore,  the  transformations  used  are  not 
numerically  stable  when  division  occurs;  they  are  numerically  stable  when  divisions  are  left 
out.  In  that  respect,  our  combining  mechanism  suffers  from  the  same  shortcomings  as 
compiler  optimization  techniques  that  use  transformations  based  on  algebraic  identities. 

It  is  possible  to  obtain  an  accurate  combining  mechanism  for  fixed  point  operations, 
not  including  division,  by  adding  one  extra  bit  to  the  intermediate  values,  thereby  increas- 
ing the  range  by  a  factor  of  two.  If  an  overflow  occurs  in  that  increased  range  then  an 
overflow  would  have  occurred  in  the  serial  execution  of  the  operations  in  the  restricted 
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range.    A  similar  technique  of  using  guard  bits  will  keep  rounding  errors  under  control 
when  floating  point  operations  not  involving  division  are  combined. 

5.5.   Full-Empty  Bits 

Accesses  to  shared  variables  can  be  synchronized  using  memory  tags.  For  example, 
the  HEP  computer  uses  a.  full-empty  bit  at  each  shared  memory  word  [Sm].  These  bits  can 
be  used  to  synchronize  accesses  in  a  producer  consumer  fashion.  Writing  may  be  condi- 
tional on  the  location  being  empty;  a  successful  write  sets  the  (full-empty)  bit.  Reading 
may  be  conditional  on  the  location  being  full;  a  successful  read  may  clear  the  (full-empty) 
bit. 

A  load  operation  has  the  same  effect  in  memory  as  the  corresponding  conditional  load 
operation.  We  may  therefore  assume  that  load  operations  are  always  executed  uncondi- 
tionally: a  processor  can  check  the  value  of  the  full-empty  bit  returned  by  the  load  instruc- 
tion to  determine  if  it  was  successful.  A  conditional  store  instruction  that  fails  returns  a 
negative  acknowledgement;  the  processor  may  resend  it  later. 

In  order  to  implement  this  synchronization  mechanism,  consider  the  four  memory 
access  instructions  (which  are  defined  formally  below)  that  form  the  basis  of  those  in 
tagged  memory  architectures:  load,  load-and-clear,  store-and-set,  and  store-if-clear-and- 
set. 

Let  the  pair  (X.flag)  represent  the  variable  X  and  its  associated  full-empty  bit  flag. 
Temporarily  assume  that  stores  are  actually  implemented  as  swaps,  i.e.  they  return  the  old 
value.  In  order  to  implement  the  instruction  set  as  RMW  instructions,  one  needs  four 
types  of  mappings. 

(1)  The  identity  mapping    for /o(3(i:  (X, flag)    -   (X,flag). 

(2)  The  mapping   for  load-and-clear.  (X,flag)    -   (X,0). 

(3)  The  mapping    for  store-and-set:  (X,flag)    -«   (v,l). 

(4)  The  mapping  for  store-if-clear-and-set: 

[  (v,l)     if  flag  =  0 
(X,flag)^       (X,l)    if  flag  =  1 
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To    close    this    set   of  mappings   under    composition,    two    more   mappings   must  be 
included: 

(5)  The  mapping  (X.flag)    -  (v,0)  is  a  store-and-clear.    It  implements  a  store-and-set  fol- 
lowed by  a  load-and-clear. 

(6)  The  mapping 

(v,0)     if  flag  =  0 


-I 


(X.flag)      1    ^j^Q^    if  flag  =  1 

is  a  store-if-clear-and-clear .    It  implements  a  store-if-clear-and-set  followed  by  a  load- 
and-clear. 

These  requests  can  now  be  combined.  The  combine  logic  is  simple.  Each  of  the  six 
types  of  instructions  can  be  encoded  by  a  short  opcode,  an  address,  and  optionally  a  data 
word. 

A  store  request  carries  one  data  value.  A  reply  to  a  request  needs  to  carry  a  data 
value  only  if  the  request  is  a  load  or  a  combined  store  that  contains  a  simple  load  instruc- 
tion. If  these  store  instructions  are  handled  specially,  then  the  number  of  data  values 
transmitted  through  a  combining  network  will  never  exceed  the  number  that  would  have 
been  transmitted  in  an  uncombining  network. 

There  is  a  problem  if  the  instruction  set  includes  a  standard  store  instruction,  i.e.  one 
that  does  not  change  the  full-empty  bit.  If  a  store  followed  by  a  store-if-clear-and-set  are 
to  combine,  it  cannot  be  determined  a  priori  which  store  will  actually  be  executed.  One 
solution  is  to  forward  both  store  values.  A  better  solution  is  simply  to  reverse  the  order  of 
the  requests  (to  be  the  store-if-clear-and-set  followed  by  the  store).  These  can  be  for- 
warded as  a  store-and-set  instruction. 

Reversing  the  order  does  not  always  help.  For  example,  if  the  operations  store-if- 
clear  and  store-if-set  are  combined,  both  store  values  have  to  be  forwarded.  As  we  will 
see  in  the  next  section  in  a  much  more  general  context,  even  if  we  include  all  types  of  full- 
empty  instructions,  no  request  will  ever  have  to  carry  more  than  two  store  values. 

We  assumed  in  this  section  a  busy-waiting  model  for  synchronization:  an  operation 
that  fails  returns  a  negative  acknowledgement;  the  processor  may  retry  later.  An  alterna- 
tive mechanism  is  to  queue  a  request  at  memory  until  it  is  executable.    This  decreases  the 
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network  traffic.    However,  unless  some  time-out  mechanism  is  available  at  the  memory 
controller,  the  hardware  may  deadlock. 

Assume  the  two  operations  load-and-clear-if-set  and  store-and-set-if-clear  are  used  to 
access  memory  in  a  queueing  system.  Memory  accesses  at  a  location  are  executed  in  a 
sequence  of  alternating  loads  and  stores.  Thus,  a  set  of  i  load  and  j  stores  can  be  combined 
into  |i— j|  +  l  operations:  stores  are  combined  with  loads,  with  the  excess  of  loads  or  stores 
staying  uncombined.  While  combining  is  not  guaranteed  to  reduce  traffic  in  the  worst  case, 
one  can  expect  it  will  do  so  in  the  average  case. 

5.6.   Data-Level  Synchronization 

One  can  have  more  than  two  possible  states  (full  and  empty),  and  operations  other 
than  read  and  write  on  data.  In  a  general  data-level  synchronization  scheme,  we  have  a 
semigroup  <t>  of  mappings  representing  the  RMW  operations  that  can  be  executed,  and  a 
set  S  of  states.  Each  variable  is  tagged  by  its  state.  The  execution  of  an  operation  on  a 
variable  is  conditional  on  its  being  in  a  suitable  state;  the  operation  also  changes  the 
variable's  state. 

This  mechanism  can  be  represented  by  an  automaton  A  =  <<I5,S,6>,  where  8:Sx$  -  S 
is  the  state  transition  function.  Assume  that  variable  X  is  in  state  s,  and  an  RMW(X,f) 
instruction  is  issued.  If  8(s,f)  =  e  (i.e.  undefined)  the  instruction  fails,  and  a  negative  ack- 
nowledgement is  returned.  Otherwise,  RMW(X,f)  is  executed,  and  the  new  state  of  X  is 
set  to  8(s,f).    Define  the  mapping  f  by 


f 


f>(y     ^    =     I     ^   f(^)    ,8(s.f))       if8(s,  f)^€ 

(X,s)       ■{    ^^^  ^^  otherwise 


Then  the  execution  of  the  instruction  RMW(X,f)  under  the  control  of  the  automaton  A  is 
equivalent  to  the  execution  of  the  instruction  RMW((X,s),f'). 

Consider  now  the  case  where  the  operations  executed  are  stores  and  loads.    The  basic 
instructions  are  then 

(1)  load(X,S,8):   Load  from  X  if  state  s  is  in  S  and  change  state  to  8(s). 

(2)  store(X,v,S,8):    Store  the  value  v  into  X  if  state  s  is  in  S  and  change  state  to  8(s). 

For  uniformity,  we  represent  a  load  by  the  tuple  (X,fl,S,8),  where  the  special  value  Cl 
represents  the  fact  that  no  store  is  executed.    A  combined  request  then  has  the  form  <X, 
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(vi,Si,8i),  ...,  (Vk,Sk,8k)>,  where  the  S;  are  disjoint  sets  of  states.  The  meaning  of  this 
instruction  is:  if  state  s  is  in  S;  then  store  v;  (or  store  nothing  if  Vi=  fi)  and  change  to  state 
8i(s).    If  s  is  not  in  any  S;,  then  the  instruction  fails. 

A  combined  instruction  that  represents  k  atomic  store  instructions  carries  at  most  k 
store  values.  Also,  a  combined  instruction  never  carries  more  than  |S|  store  values,  where 
|S|  is  the  number  of  states  of  the  controlling  automaton  A.  This  is  in  general  the  best  pos- 
sible bound:  if  there  is  an  instruction  store-if-state  =  s  for  each  state  s  of  A,  then  a  com- 
bined store  may  have  to  carry  a  distinct  store  value  for  each  state.  This  is  tractable  when 
the  number  of  states  is  small,  such  as  when  a  full-empty  bit  is  used;  it  is  not  tractable  when 
the  number  of  states  is  large.  For  example,  the  synchronization  primitives  defined  by  Zhu 
and  Yew  [ZY]  for  the  Cedar  machine  at  the  University  of  Illinois  and  by  Pier  and  Gajski 
[GP]  use  full  word  tags.  With  m  bit  tags,  there  are  2"^  possible  states,  and  2™  is  the  best 
possible  uniform  bound  on  the  number  of  store  values  in  a  combined  request. 

Memory  accesses  controlled  by  a  regular  automaton  can  be  used  to  support  simple 
path  expressions  [CH].  Path  expressions  are  used  to  synchronize  access  to  shared  objects. 
For  each  such  object  there  is  a  set  of  possible  operations  on  it.  A  regular  expression  over 
the  alphabet  consisting  of  these  operations  defines  the  language  of  legal  sequences  of 
operation  applications  on  each  object. 

A  deterministic  automaton  corresponding  to  the  path  expression  is  built.  Each  object 
is  represented  by  a  variable  in  memory,  to  which  access  is  protected  by  this  automaton. 
Each  execution  of  a  protected  operation  is  preceded  by  an  access  to  that  variable  that  per- 
forms the  corresponding  automaton  transition.  Then  the  executions  of  the  operations  are 
sequenced  according  to  the  path  expression.  The  mechanism  suggested  in  this  section 
allows  an  efficient  implementation  of  such  a  system. 

6.   Rmw  and  Parallel  Prefix 

This  section  shows  the  relationship  between  the  combining  mechanism  presented  in 
this  paper  with  a  well  known  computational  problem,  prefix  computation.  The  combining 
logic  turns  out  to  be  an  asynchronous  version  of  a  well  known  parallel  synchronous  algo- 
rithm.   This  sheds  further  light  on  performance  aspects  of  combining. 
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Consider  successive  execution  of  the  operations  RMW(S,fi),  ...,  RMW(S,fn).    These 

operations      return      the      values      S,      fi(S) fn-i(  "  '  '  (fi(S))  •  •  •  );     the      value 

fn(  ■  ■  ■  (fi(S))  •  •  ■  )  is  stored  in  memory.  Thus,  execution  of  these  instructions  amounts  to 
the  computation  of  S,  fi(S),  ...,  f^C  •  •  •  (fi(S))  •  •  •  )  or,  equivalently,  to  the  computation 
of  Is,  Igo  fj,  ....  Igo  fjo    •  •  •  o  fj,.    This  is  a  particular  instance  of  the  prefix  computation 

problem  [LaF]:  given  Xj Xj,,  compute  Xi,  Xi*X2,  ...,  Xj*  •  •  •  *x^,  where  the  operation  * 

is  an  arbitrary  associative  operation.    In  our  case,  *  is  map  composition. 

Prefix  computation  when  solved  in  parallel  is  known  as  parallel  prefix.  The  memory 
access  mechanism  proposed  in  this  paper  provides  in  fact  a  parallel  solution  to  the  prefix 
computation  problem.  The  computations  are  performed  on  the  nodes  of  a  tree  in  the  inter- 
connection network  that  connects  the  processors  to  one  memory  module.  In  a  multistage 
network,  in  which  processors  have  at  most  one  outstanding  request  to  each  memory  loca- 
tion, this  is  a  physical  tree,  which  is  a  subgraph  of  the  network.  In  other  cases  this  is  a  vir- 
tual tree:  operations  pertaining  to  distinct  levels  in  the  tree  are  executed  at  the  same  node 
of  the  network. 

The  problem  solved  by  the  combining  network  differs  from  parallel  prefix  in  that  the 
order  of  the  elements  combined  (with  the  exception  of  the  first)  is  arbitrary.  By  ordering 
the  operations  correctly,  one  obtains  a  distributed,  asynchronous  network  that  solves  the 
parallel  prefix  problem. 

The  computation  is  performed  on  a  network  of  processes  connected  as  a  (not  neces- 
sarily complete)  binary  tree  with  n  leaves.  The  inputs  are  stored  at  the  n  leaves  of  a  binary 
tree,  which  corresponds  to  the  processors  of  the  parallel  computer.  The  root  of  the  tree 
has  one  parent,  called  superoot;  it  corresponds  to  the  memory  module  that  contains  the 
variable  accessed;  the  internal  nodes  of  the  tree  correspond  to  the  combining  switches  in 
the  processor  to  memory  interconnection  network.  We  describe  below  in  CSP  notation 
[Ho]  the  different  types  of  processes. 

Leaf  Process 

[Leaf::  val; 

parent  !  val; 

parent  ?  val 
] 
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Internal  Node  Process 

[Node::  Ival,  rval,  pval; 

left_child  ?  Ival; 

right_child  ?  rval; 

parent  !  lval*rval; 

parent  ?  pval; 

left_child  !  pval; 

right_child  !  pval*lval 
1 

Superoot  Process 

[Superoot::  val; 

child  ?  val; 

child  !  id 
] 

Let  valj  be  the  initial  value  at  the  /-th  leaf.  At  the  end  of  the  computation  the  value  at 
the  /-th  leaf  equals  to  valj*  •  •  •  *vali_i;  the  value  at  the  superoot  equals  to  val^*  ■  •  •  *valn. 

If  the  tree  is  complete,  then  the  operations  performed  by  this  tree  are  exactly  the  same 
operations  performed  by  the  Ladner-Fisher  parallel  prefix  network  [LaF].  The  global 
clock  synchronization  used  by  their  algorithm  is  replaced  by  local  data-flow  synchroniza- 
tion. Each  internal  node  performs  two  multiplications,  of  which  [Ign]  are  trivial.  Thus, 
2n  — 2—  [Ign]  nontrivial  multiplications  are  done.  The  algorithm  can  be  implemented  to 
run  in  2  flgnl-2  multiplication  cycles,  when  globally  synchronized. 

7.   Conclusion 

This  paper  provides  and  exemplifies  a  formal  method  for  reasoning  about  the  correct- 
ness of  parallel  computer  architectures.  It  provides  the  theoretical  underpinnings  of  the 
combining  mechanism  used  by  the  NYU  Ultracomputer  and  RP3.  It  presents  a  general 
formulation  of  RMW  operations  and  a  general  mechanism  to  efficiently  support  such 
operations. 

A  significant  amount  of  supplementary  hardware  is  required  to  combine  RMW  opera- 
tions. Each  switch  needs  logic  that  is  able  to  compute  mapping  compositions  and  mapping 
applications;  extra  logic  is  also  required  at  the  memory  module.  The  switches  also  need  an 
associative  store  to  store  information  on  combined  requests. 

The  need  for  associative  retrieval  at  the  switches  can  be  avoided  at  the  expense  of 
more  expensive  labeling  schemes.    An  implementation  of  an  efficient  switch  that  supports 
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combining  of  fetch-and-add  requests  is  described  in  [DKS],  [DKSS].    This  switch  has  been 
partially  implemented.   The  same  scheme  can  be  used  for  other  RMW  operations. 

Note  that  one  can  use  combining  logic  that  detects  only  part  of  the  combinable  pairs. 
Memory  accesses  are  correctly  performed  even  with  partial  combining,  or  no  combining  at 
all.   Thus,  different  cost-performance  tradeoffs  are  possible. 

Combining  or  partial  combining  can  be  used  on  a  wide  variety  of  interconnection  net- 
works. The  only  major  restriction  is  that  requests  must  return  via  the  same  route 
(although  in  the  reverse  direction).  Thus,  the  mechanisms  described  in  this  paper  can  be 
easily  adopted  for  use  by  direct  connection  machines,  such  as  the  cosmic  cube  [Se],  where 
the  processors  themselves  act  like  network  switches  and  the  local  memories  at  each  node 
are  all  view  as  part  of  a  distributed,  shared  memory. 
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